3 research outputs found

    Application of Machine Learning in Cancer Research

    Full text link
    This dissertation revisits the problem of five-year survivability predictions for breast cancer using machine learning tools. This work is distinguishable from the past experiments based on the size of the training data, the unbalanced distribution of data in minority and majority classes, and modified data cleaning procedures. These experiments are also based on the principles of TIDY data and reproducible research. In order to fine-tune the predictions, a set of experiments were run using naive Bayes, decision trees, and logistic regression. Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. One of The main contributions of this work is that logistic regression with the proper predictors and class weight gives the highest precision/recall level for the minority class. In regression modeling with large number of predictors, correlation among predictors is quite common, and the estimated model coefficients might not be very reliable. In these situations, the Variance Inflation Factor (VIF) and the Generalized Variance Inflation Factor (GVIF) are used to overcome the correlation problem. Our experiments are based on the Surveillance, Epidemiology, and End Results (SEER) database for the problem of survivability prediction. Some of the specific contributions of this thesis are: · Detailed process for data cleaning and binary classification of 338,596 breast cancer patients. · Computational approach for omitting predictors and categorical predictors based on VIF and GVIF. · Various applications of Synthetic Minority Over-sampling Techniques (SMOTE) to increase precision and recall. · An application of Edited Nearest Neighbor to obtain the highest F1-measure. In addition, this work provides precise algorithms and codes for determining class membership and execution of competing methods. These codes can facilitate the reproduction and extension of our work by other researchers

    Privacy-preserving Data Mining on Hospitality Big Data

    Full text link
    In this paper we present a summary of our activity associated with the security and encryption the big data on the hotels big data. We give a brief introduction to some techniques for security the data set in large scale, we then look into the homomorphism and Map Reduce environment. With the advances in computer architecture and silicon technology, processing large data set becomes possible after some fundamental data storage and processing algorithm been proposed and implemented. Analyzing the big data opened many opportunities for scientists in different research and application areas. Hospitality industry, for example, collects and keeps customers information, which proposed some significant challenges to be addressed. The first challenge how to save and keep all these massive data set; and the second challenge is securing this sensitive information. In this paper we will discuss Parallel Homomorphic Encryption (PHE) security method which can be used in companies’ information storage, processing and management

    Data Analysis With Map Reduce Programming Paradigm

    Full text link
    Abstract In this thesis, we present a summary of our activities associated with the storage and query processing of Google 1T 5-gram data set. We rst give a brief introduction to some of the implementation techniques for the relational algebra followed by a Map Reduce implementation of the same operators. We then implement a database schema in Hive for the Google 1T 5-gram data set. The thesis will further look into the query processing with Hive and Pig in the Hadoop setting. More specially, we report statistics for our queries in this environment
    corecore